Marowijne District
Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search
Fadeeva, Ekaterina, Goloburda, Maiya, Rubashevskii, Aleksandr, Vashurin, Roman, Shelmanov, Artem, Nakov, Preslav, Sachan, Mrinmaya, Panov, Maxim
Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.
- Europe > Austria > Vienna (0.14)
- Europe > Middle East > Cyprus (0.04)
- South America > Suriname > Marowijne District > Albina (0.04)
- (3 more...)
Dejavu: Towards Experience Feedback Learning for Embodied Intelligence
Wu, Shaokai, Ji, Yanbiao, Li, Qiuchang, Zhang, Zhiyi, He, Qichen, Xie, Wenyuan, Zhang, Guodong, Bayramli, Bayram, Ding, Yue, Lu, Hongtao
Embodied agents face a fundamental limitation: once deployed in real-world environments to perform specific tasks, they are unable to acquire additional knowledge to enhance task performance. In this paper, we propose a general post-deployment learning framework Dejavu, which employs an Experience Feedback Network (EFN) and augments the frozen Vision-Language-Action (VLA) policy with retrieved execution memories. EFN identifies contextually prior action experiences and conditions action prediction on this retrieved guidance. We adopt reinforcement learning with semantic similarity rewards to train EFN, ensuring that the predicted actions align with past behaviors under current observations. During deployment, EFN continually enriches its memory with new trajectories, enabling the agent to exhibit "learning from experience". Experiments across diverse embodied tasks show that EFN improves adaptability, robustness, and success rates over frozen baselines. We provide code and demo in our supplementary material.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- South America > Suriname > Marowijne District > Albina (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (2 more...)
- Research Report (0.81)
- Workflow (0.67)
HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks
Assadi, Adnan El, Chung, Isaac, Solomatin, Roman, Muennighoff, Niklas, Enevoldsen, Kenneth
Comparing human and model performance offers a valuable perspective for understanding the strengths and limitations of embedding models, highlighting where they succeed and where they fail to capture meaning and nuance. However, such comparisons are rarely made, as human performance on embedding tasks is difficult to measure. To fill this gap, we introduce HUME: Human Evaluation Framework for Text Embeddings. While frameworks like MTEB provide broad model evaluation, they lack reliable estimates of human performance, limiting the interpretability of model scores. We measure human performance across 16 MTEB datasets spanning reranking, classification, clustering, and semantic textual similarity across linguistically diverse high- and low-resource languages. Humans achieve an average performance of 77.6% compared to 80.1% for the best embedding model, though with substantial variation: models reach high performance on some datasets while struggling on notably low-resource languages. Our human annotations also reveal multiple dataset issues. We additionally benchmark nine LLMs as annotators on reranking, classification, and STS tasks, finding that they fall short of human performance (76.1% vs. 81.2%) despite offering scalability advantages. We provide human performance baselines, insights into task difficulty patterns, and an extensible evaluation framework that enables a more meaningful interpretation of results and informs the development of both models and benchmarks. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- South America > Suriname > Marowijne District > Albina (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- (14 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX
Martynov, Nikita, Mordasheva, Anastasia, Gorbetskiy, Dmitriy, Astafurov, Danil, Isaeva, Ulyana, Basyrova, Elina, Skachkov, Sergey, Berestova, Victoria, Ivanov, Nikolay, Zanina, Valeriia, Fenogenova, Alena
The full statistics of all the criteria grouped by the panel assignments are presented in Table 7. Tables 8 and A.1 represent the statistics of the generated scores and rationales for criteria annotation. As we can see, the distributions of criterion-based scores for most criteria are largely comparable between expert-written and synthetic datasets, despite the underlying evaluated instruction-answer pairs being entirely distinct and non-overlapping. This is particularly evident in the mean, standard deviation, and mode of scores, which, across a wide range of criteria types, demonstrate close alignment - suggesting that criterion-level assessment remains consistent across both data sources. Tables 8 and A.1 suggest that synthetically generated texts (both instructions and rationales) are lengthier, being at the same time less original than those written by the experts. Tables also show that DeepSeek-R1 tends to assign a mediocre score of 1 rather than choosing extreme values. Despite these statistical and stylistic differences in commentary, the synthetic dataset remains a viable resource for training the LLM-as-a-Judge Family, especially considering the overall similarity in criterion-based scores. Thus, while the expert-written feedback exhibits optimized brevity and contextual appropriateness, the synthetic commentary maintains an adequate level of informative-ness and coherence.
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Asia > Russia (0.04)
- South America > Suriname > Marowijne District > Albina (0.04)
- (9 more...)
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
Chervyakov, Artem, Kharitonov, Alexander, Zadorozhny, Pavel, Pavel, Adamenko, Levichev, Rodion, Vorobev, Dmitrii, Salikhov, Dmitrii, Valeev, Aidar, Pestova, Alena, Dziuba, Maria, Alimova, Ilseyar, Zavgorodnev, Artem, Medvedev, Aleksandr, Moiseev, Stanislav, Bruches, Elena, Grebenkin, Daniil, Derunets, Roman, Vladimir, Vikulov, Emelyanov, Anton, Babaev, Dmitrii, Ivanov, Vladimir V., Malykh, Valentin, Fenogenova, Alena
Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.
- Europe > Austria > Vienna (0.14)
- South America > Suriname > Marowijne District > Albina (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (6 more...)
- Information Technology > Security & Privacy (1.00)
- Education (1.00)
Fast-DataShapley: Neural Modeling for Training Data Valuation
Sun, Haifeng, Xiong, Yu, Wu, Runze, Cai, Xinyu, Fan, Changjie, Zhang, Lan, Li, Xiang-Yang
The value and copyright of training data are crucial in the artificial intelligence industry. Service platforms should protect data providers' legitimate rights and fairly reward them for their contributions. Shapley value, a potent tool for evaluating contributions, outperforms other methods in theory, but its computational overhead escalates exponentially with the number of data providers. Recent works based on Shapley values attempt to mitigate computation complexity by approximation algorithms. However, they need to retrain for each test sample, leading to intolerable costs. We propose Fast-DataShapley, a one-pass training method that leverages the weighted least squares characterization of the Shapley value to train a reusable explainer model with real-time reasoning speed. Given new test samples, no retraining is required to calculate the Shapley values of the training data. Additionally, we propose three methods with theoretical guarantees to reduce training overhead from two aspects: the approximate calculation of the utility function and the group calculation of the training data. We analyze time complexity to show the efficiency of our methods. The experimental evaluations on various image datasets demonstrate superior performance and efficiency compared to baselines. Specifically, the performance is improved to more than 2 times, and the explainer's training speed can be increased by two orders of magnitude.
- North America > United States > Idaho > Ada County > Boise (0.05)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- Asia > China > Anhui Province > Hefei (0.04)
- (6 more...)
LAOF: Robust Latent Action Learning with Optical Flow Constraints
Bu, Xizhou, Lyu, Jiexi, Sun, Fulei, Yang, Ruichen, Ma, Zhiqiang, Li, Wei
Learning latent actions from large-scale videos is crucial for the pre-training of scalable embodied foundation models, yet existing methods often struggle with action-irrelevant distractors. Although incorporating action supervision can alleviate these distractions, its effectiveness is restricted by the scarcity of available action labels. Optical flow represents pixel-level motion between consecutive frames, naturally suppressing background elements and emphasizing moving objects. Motivated by this, we propose robust Latent Action learning with Optical Flow constraints (LAOF), a pseudo-supervised framework that leverages the agent's optical flow as an action-driven signal to learn latent action representations robust to distractors. Experimental results show that the latent representations learned by LAOF outperform existing methods on downstream imitation learning and reinforcement learning tasks. This superior performance arises from optical flow constraints, which substantially stabilize training and improve the quality of latent representations under extremely label-scarce conditions, while remaining effective as the proportion of action labels increases to 10%. Importantly, even without action supervision, LAOF matches or surpasses action-supervised methods trained with 1% of action labels. Code can be found at https://github.com/XizoB/LAOF
RADAR: Benchmarking Language Models on Imperfect Tabular Data
Gu, Ken, Zhang, Zhihan, Lin, Kate, Zhang, Yuwei, Paruchuri, Akshay, Yu, Hong, Kazemi, Mehran, Ayush, Kumar, Heydari, A. Ali, Xu, Maxwell A., Narayanswamy, Girish, Liu, Yun, Poh, Ming-Zher, Yang, Yuzhe, Malhotra, Mark, Patel, Shwetak, Palangi, Hamid, Xu, Xuhai, McDuff, Daniel, Althoff, Tim, Liu, Xin
Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.
- Asia > Middle East > UAE (0.14)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- Europe > United Kingdom > Wales (0.04)
- (43 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Semantic Label Drift in Cross-Cultural Translation
Kabir, Mohsinul, Ahmed, Tasnim, Rahman, Md Mezbaur, Giannouris, Polydoros, Ananiadou, Sophia
Machine Translation (MT) is widely employed to address resource scarcity in low-resource languages by generating synthetic data from high-resource counterparts. While sentiment preservation in translation has long been studied, a critical but underexplored factor is the role of cultural alignment between source and target languages. In this paper, we hypothesize that semantic labels are drifted or altered during MT due to cultural divergence. Through a series of experiments across culturally sensitive and neutral domains, we establish three key findings: (1) MT systems, including modern Large Language Models (LLMs), induce label drift during translation, particularly in culturally sensitive domains; (2) unlike earlier statistical MT tools, LLMs encode cultural knowledge, and leveraging this knowledge can amplify label drift; and (3) cultural similarity or dissimilarity between source and target languages is a crucial determinant of label preservation. Our findings highlight that neglecting cultural factors in MT not only undermines label fidelity but also risks misinterpretation and cultural conflict in downstream applications.
- South America > Suriname > Marowijne District > Albina (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > Canada > Ontario (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Uncertainty Quantification for Hallucination Detection in Large Language Models: Foundations, Methodology, and Future Directions
Kang, Sungmin, Bakman, Yavuz Faruk, Yaldiz, Duygu Nur, Buyukates, Baturalp, Avestimehr, Salman
The rapid advancement of large language models (LLMs) has transformed the landscape of natural language processing, enabling breakthroughs across a wide range of areas including question answering, machine translation, and text summarization. Yet, their deployment in real-world applications has raised concerns over reliability and trustworthiness, as LLMs remain prone to hallucinations that produce plausible but factually incorrect outputs. Uncertainty quantification (UQ) has emerged as a central research direction to address this issue, offering principled measures for assessing the trustworthiness of model generations. We begin by introducing the foundations of UQ, from its formal definition to the traditional distinction between epistemic and aleatoric uncertainty, and then highlight how these concepts have been adapted to the context of LLMs. Building on this, we examine the role of UQ in hallucination detection, where quantifying uncertainty provides a mechanism for identifying unreliable generations and improving reliability. We systematically categorize a wide spectrum of existing methods along multiple dimensions and present empirical results for several representative approaches. Finally, we discuss current limitations and outline promising future research directions, providing a clearer picture of the current landscape of LLM UQ for hallucination detection.
- North America > United States > California (0.14)
- Europe > Austria > Vienna (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.05)
- (14 more...)
- Overview (1.00)
- Research Report > New Finding (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.45)